<html>

<head>
<title>Rationale behind using JNI as opposed to threads in a remote JVM process</title>
    <style>
<!--
h1           { font-size: 18pt }
h2           { font-size: 14pt; margin-top: 12; margin-bottom: 3 }
p            { margin-top: 0; margin-bottom: 6 }
-->
    </style>
</head>

<body>
<h1>Rationale behind using JNI as opposed to threads in a remote JVM process.</h1>
<p><font size="2">Java&#8482; is a registered trademark of Sun Microsystems, Inc. in the United States and other countries.</font></p>
<h2>Reasons to use a high level language like Java&#8482; in the backend</h2>

<p>A large part of the reason why JNI was chosen in favor of an RPC based,
single JVM solution was due to the expected use-cases. Enterprise systems today are almost always
3-tier or n-tier. Database functions, triggers, and stored procedures are
mechanisms that extend the functionality of the backend tier. They typically
rely on a tight integration with the database due to a very high rate of
interactions and execute inside of the database largely to limit the number of
interactions between the middle tier and the backend tier. Some typical
use-cases: </p>

<ul type=disc>
 <li>Referential integrity
     enforcement. Using Java, referential integrity that goes beyond what can
     be provided using the standard SQL semantics can be provided. It might
     involve checking XML documents, enforcing some meta-driven rule system, or
     other complex tasks that put high demands on the implementation language. </li>
 <li>Advanced pattern recognition.
     Soundex, image comparison, etc. </li>
 <li>XML support functions. Java
     comes with a lot of XML support. Parsers etc. are readily available. </li>
 <li>Support functions for O/R
     mappers. A variety of support can be implemented depending on design. One
     example is an O/R mapper that allows methods on persistent objects. A lot
     can be gained if such methods are pushed down and executed within the
     database. Consider the following (OQL):

<p><code>
SELECT AVG(x.salary - x.computeTax()) FROM Employee x WHERE x.salary &gt; 120000;
</code></p>

<p>Pushing the computeTax logic down to the database instead of
computing it in the middle tier (where much or the O/R logic resides) is a huge
gain from a performance standpoint. The statement could be transformed into SQL
as: </p>

<p><code>
SELECT AVG(x.salary - computeTax(x.salary)) FROM Employee x WHERE x.salary &gt; 120000;
</code></p>

<p>As a result, very few interactions (typically only one) need
to be made between the middle and the backend tier. </p>
</ul>
<ul type=disc>
 <li>Views and indexes making use
     of computed values. In the above example and index could be created on
     computeTax(x.salary) and a view could express that as net_income. </li>
 <li>Message queue management.
     Delivering or fetching things using message queues or other delivery
     mechanisms. As with most interactions with other processes, this requires
     transaction coordination of some kind. </li>
</ul>

<p>One might argue that since a JVM often is present running an app-server in
the middle tier, would it not be more efficient if that JVM also executed the
database functions and triggers? In my opinion, this would be very bad. One
major reason for moving execution down to the database is performance (by
minimizing the number of roundtrips between the app-server and the database)
another is separation of concern. Referential data integrity and other ways to
extend the functionality of the database should not be the app-servers concern,
it belongs in the backend tier. Other aspects like database versus app-server
administration, replication of code and permission changes for functions, and
running different tiers on different servers, makes it even worse. </p>

<h2>Resource consumption</h2>

<p>Having one JVM per connection instead of one thread per connection running
in the same JVM will undoubtedly consume more resources. There are however a
couple of facts that must be remembered: </p>

<ul type=disc>
 <li>The overhead of multiple
     processes is already present due to the fact that each connection is a
     process in a PostgreSQL system. </li>
 <li>In order to keep connections
     separated in case they run in the same JVM, some kind of &quot;compartments&quot;
     must be created. Either you create them using parallel class loader chains
     (similar to how EAR files are managed in an EJB server) or you use a less
     protective model similar to a servlet engine. In order to get a separation
     that comparable to what you get using separate JVM's, you have to go for
     the former. That consumes some resources. </li>
 <li>The JVM has undergone a
     series of improvements in order to reduce footprint and startup time. Some
     significant improvements where made in Java 1.4 and Java 1.5 introduces
     Java Heap Self Tuning, Class Data Sharing, and Garbage Collector
     Ergonomics (read more <a
     href="http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html#vm">here</a>),
     technologies that will minimize the startup time and make the JVM adopt
     its resource consumption in a much improved way.</li>
 <li>PL/Java can make use of the <a
     href="http://gcc.gnu.org/java">GCJ</a>. Using this technology, all core
     classes will be compiled into binaries and optionally pre-loaded by the
     postmaster. It also means that all modules that are loaded using the
     install_jar/replace_jar can be compiled into real shared objects. Finally,
     it means that the footprint for each &quot;JVM&quot; will be significantly
     decreased. </li>
</ul>

<h2>Connection pooling</h2>

<p>In the Java community you are very likely to use a connection pool. The pool
will ensure that the number of connections stays as low as possible and that
connections are reused (instead of closed and reestablished). New JVMs are
started rarely. </p>

<h2>Connection isolation</h2>

<p>Separate JVMs gives you a much higher degree of isolation. This brings a
number of advantages: </p>

<ul type=disc>
 <li>There's no problem attaching
     a debugger to one connection (one JVM) while the others run unaffected. </li>
 <li>There's no chance that one
     connection manages to accidentally (or maliciously) exchange dirty data
     with another connection. </li>
 <li>A process that performs tasks
     that consume a lot of CPU under a long period of time can be scheduled
     with a lower priority using a simple OS command. </li>
 <li>The JVMs can be brought down
     and restarted individually. </li>
 <li>Security policies are much
     easier to enforce. </li>
</ul>

<h2>Transaction visibility</h2>

<p>In order to maintain the correct visibility, the transaction must somehow be
propagated to the Java layer. I can see two solutions for this using RPC.
Either an XA aware JDBC-driver is used (requires XA support from PostgreSQL) or
a JDBC driver is written so that it calls back to the SPI functions in the
invoking process. Both choices results in an increased number of RPC calls and
a negative performance impact. </p>

<p>The PL/Java approach is to use the underlying SPI interfaces directly
through JNI by providing a &quot;pseudo connection&quot; that implements the
JDBC interfaces. The mapping is thus very direct. Data need never be serialized
nor duplicated. </p>

<h2>RPC performance</h2>

<p>Remote procedure calls are extremely expensive compared to in-process calls.
Relying on an RPC mechanism for Java calls will cripple the usefulness of such
an implementation a great deal. Here are two examples: </p>

<ul type=disc>
 <li>In order for an update
     trigger to function using RPC, you can choose one of two approaches.
     Either you limit the number of RPC calls and send two full Tuples (old and
     new) and a Tuple Descriptor to the remote JVM, and then pass a third Tuple
     (the modified new) back to the original, or you pass those structures by
     reference (as CORBA remote objects) and perform one RPC call each time you
     access them. You have a tradeoff between on one hand, limited
     functionality and poor performance, and on the other, good functionality
     and really bad performance. </li>
 <li>When one or several Java
     functions are used in the projection or filter of a SELECT statement on a
     query processing several thousand rows, each row will cause at least one
     call to Java. In case of RPC, this implies that the OS needs to do at
     least two context switches (back and forth) for each row in the query. </li>
</ul>

<p>Using JNI to directly access structures like TriggerData, Relation,
TupleDesc, and HeapTuple minimizes the amount of data that needs to be copied.
Parameters and return values that are primitives need not even become Java
objects. A 32-bit int4 Datum can be directly passed as a Java int (jint in
JNI). </p>

<h2>Simplicity</h2>

<p>I've have some experience of work involving CORBA and other RPCs. They add a
fair amount of complexity to the process. JNI however, is invisible to the
user. </p>

</body>

</html>
